INTERSPEECH.2022 - Speech Recognition

Total: 265

#1 Regularizing Transformer-based Acoustic Models by Penalizing Attention Weights [PDF] [Copy] [Kimi1]

Authors: Munhak Lee ; Joon-Hyuk Chang ; Sang-Eon Lee ; Ju-Seok Seong ; Chanhee Park ; Haeyoung Kwon

The application of deep learning has significantly advanced the performance of automatic speech recognition (ASR) systems. Various components make up an ASR system, such as the acoustic model (AM), language model, and lexicon. Generally, the AM has benefited the most from deep learning. Numerous types of neural network-based AMs have been studied, but the structure that has received the most attention in recent years is the Transformer. In this study, we demonstrate that the Transformer model is more vulnerable to input sparsity compared to the convolutional neural network (CNN) and analyze the cause of performance degradation through structural characteristics of the Transformer. Moreover, we also propose a novel regularization method that makes the transformer model robust against input sparsity. The proposed sparsity regularization method directly regulates attention weights using silence label information in forced-alignment and has the advantage of not requiring additional module training and excessive computation. We tested the proposed method on five benchmarks and observed an average relative error rate reduction (RERR) of 4.7%.

#2 Content-Context Factorized Representations for Automated Speech Recognition [PDF] [Copy] [Kimi1]

Authors: David Chan ; Shalini Ghosh

Deep neural networks have largely demonstrated their ability to perform automated speech recognition (ASR) by extracting meaningful features from input audio frames. Such features, however, may consist not only of information about the spoken language content, but also may contain information about unnecessary contexts such as background noise and sounds or speaker identity, accent, or protected attributes. Such information can directly harm generalization performance, by introducing spurious correlations between the spoken words and the context in which such words were spoken. In this work, we introduce an unsupervised, encoder-agnostic method for factoring speech-encoder representations into explicit content-encoding representations and spurious context-encoding representations. By doing so, we demonstrate improved performance on standard ASR benchmarks, as well as improved performance in both real-world and artificially noisy ASR scenarios.

#3 Comparison and Analysis of New Curriculum Criteria for End-to-End ASR [PDF] [Copy] [Kimi1]

Authors: Georgios Karakasidis ; Tamás Grósz ; Mikko Kurimo

It is common knowledge that the quantity and quality of the training data play a significant role in the creation of a good machine learning model. In this paper, we take it one step further and demonstrate that the way the training examples are arranged is also of crucial importance. Curriculum Learning is built on the observation that organized and structured assimilation of knowledge has the ability to enable faster training and better comprehension. When humans learn to speak, they first try to utter basic phones and then gradually move towards more complex structures such as words and sentences. This methodology is known as Curriculum Learning, and we employ it in the context of Automatic Speech Recognition. We hypothesize that end-to-end models can achieve better performance when provided with an organized training set consisting of examples that exhibit an increasing level of difficulty (i.e. a curriculum). To impose structure on the training set and to define the notion of an easy example, we explored multiple scoring functions that either use feedback from an external neural network or incorporate feedback from the model itself. Empirical results show that with different curriculums we can balance the training times and the network's performance.

#4 Incremental learning for RNN-Transducer based speech recognition models [PDF] [Copy] [Kimi]

Authors: Deepak Baby ; Pasquale D'Alterio ; Valentin Mendelev

This paper investigates an incremental learning framework for a real-world voice assistant employing RNN-Transducer based automatic speech recognition (ASR) model. Such a model needs to be regularly updated to keep up with changing distribution of customer requests. We demonstrate that a simple fine-tuning approach with a combination of old and new training data can be used to incrementally update the model spending only several hours of training time and without any degradation on old data. This paper explores multiple rounds of incremental updates on the ASR model with monthly training data. Results show that the proposed approach achieves 5-6\% relative WER improvement over the models trained from scratch on the monthly evaluation datasets. In addition, we explore if it is possible to improve recognition of specific new words. We simulate multiple rounds of incremental updates with handful of training utterances per word (both real and synthetic) and show that the recognition of the new words improves dramatically but with a minor degradation on general data. Finally, we demonstrate that the observed degradation on general data can be mitigated by interleaving monthly updates with updates targeting specific words.

#5 Production federated keyword spotting via distillation, filtering, and joint federated-centralized training [PDF] [Copy] [Kimi1]

Authors: Andrew Hard ; Kurt Partridge ; Neng Chen ; Sean Augenstein ; Aishanee Shah ; Hyun Jin Park ; Alex Park ; Sara Ng ; Jessica Nguyen ; Ignacio Lopez-Moreno ; Rajiv Mathews ; Francoise Beaufays

We trained a keyword spotting model using federated learning on real user devices and observed significant improvements when the model was deployed for inference on phones. To compensate for data domains that are missing from on-device training caches, we employed joint federated-centralized training. And to learn in the absence of curated labels on-device, we formulated a confidence filtering strategy based on user-feedback signals for federated distillation. These techniques created models that significantly improved quality metrics in offline evaluations and user-experience metrics in live A/B experiments.

#6 Multi-Corpus Speech Emotion Recognition for Unseen Corpus Using Corpus-Wise Weights in Classification Loss [PDF] [Copy] [Kimi1]

Authors: Youngdo Ahn ; Sung Joo Lee ; Jong Won Shin

Since each of the currently available emotional speech corpora is rather small to deal with personal or cultural diversity, multiple emotional speech corpora can be jointly used to train a speech emotion recognition (SER) model robust to unseen corpora. Each corpus has different characteristics, including whether acted or spontaneous, in which environment it was recorded, and what lexical contents it contains. Depending on the characteristics, the emotion recognition accuracy and time required to train a model for it are different. If we train the SER model utilizing multiple corpora equally, the classification performance for each training corpus would be different. The performance for unseen corpora may be enhanced if the model is trained to show similar recognition accuracy for each training corpus that covers different characteristics. In this study, we propose to adopt corpus-wise weights in the classification loss, which are functions of the recognition accuracy for each of the training corpus. We also adopt pseudo-emotion labels for the unlabeled speech corpus to further enhance the performance. Experimental results showed that the proposed method outperformed previously proposed approaches in the out-of-corpus SER using three emotional corpora for training and one corpus for evaluation.

#7 Improving Speech Emotion Recognition Through Focus and Calibration Attention Mechanisms [PDF] [Copy] [Kimi1]

Authors: Junghun Kim ; Yoojin An ; Jihie Kim

Attention has become one of the most commonly used mechanisms in deep learning approaches. The attention mechanism can help the system focus more on the feature space's critical regions. For example, high amplitude regions can play an important role for Speech Emotion Recognition (SER). In this paper, we identify misalignments between the attention and the signal amplitude in the existing multi-head self-attention. To improve the attention area, we propose to use a Focus-Attention (FA) mechanism and a novel Calibration-Attention (CA) mechanism in combination with the multi-head self-attention. Through the FA mechanism, the network can detect the largest amplitude part in the segment. By employing the CA mechanism, the network can modulate the information flow by assigning different weights to each attention head and improve the utilization of surrounding contexts. To evaluate the proposed method, experiments are performed with the IEMOCAP and RAVDESS datasets. Experimental results show that the proposed framework significantly outperforms the state-of-the-art approaches on both datasets.

#8 The Emotion is Not One-hot Encoding: Learning with Grayscale Label for Emotion Recognition in Conversation [PDF] [Copy] [Kimi1]

Author: Joosung Lee

In emotion recognition in conversation (ERC), the emotion of the current utterance is predicted by considering the previous context, which can be utilized in many natural language processing tasks. Although multiple emotions can coexist in a given sentence, most previous approaches take the perspective of a classification task to predict only a given label. However, it is expensive and difficult to label the emotion of a sentence with confidence or multi-label. In this paper, we automatically construct a grayscale label considering the correlation between emotions and use it for learning. That is, instead of using a given label as a one-hot encoding, we construct a grayscale label by measuring scores for different emotions. We introduce several methods for constructing grayscale labels and confirm that each method improves the emotion recognition performance. Our method is simple, effective, and universally applicable to previous systems. The experiments show a significant improvement in the performance of baselines.

#9 Probing speech emotion recognition transformers for linguistic knowledge [PDF] [Copy] [Kimi1]

Authors: Andreas Triantafyllopoulos ; Johannes Wagner ; Hagen Wierstorf ; Maximilian Schmitt ; Uwe Reichel ; Florian Eyben ; Felix Burkhardt ; Björn W. Schuller

Large, pre-trained neural networks consisting of self-attention layers (transformers) have recently achieved state-of-the-art results on several speech emotion recognition (SER) datasets. These models are typically pre-trained in self-supervised manner with the goal to improve automatic speech recognition performance -- and thus, to understand linguistic information. In this work, we investigate the extent in which this information is exploited during SER fine-tuning. Using a reproducible methodology based on open-source tools, we synthesise prosodically neutral speech utterances while varying the sentiment of the text. Valence predictions of the transformer model are very reactive to positive and negative sentiment content, as well as negations, but not to intensifiers or reducers, while none of those linguistic features impact arousal or dominance. These findings show that transformers can successfully leverage linguistic information to improve their valence predictions, and that linguistic analysis should be included in their testing.

#10 End-To-End Label Uncertainty Modeling for Speech-based Arousal Recognition Using Bayesian Neural Networks [PDF] [Copy] [Kimi1]

Authors: Navin Raj Prabhu ; Guillaume Carbajal ; Nale Lehmann-Willenbrock ; Timo Gerkmann

Emotions are subjective constructs. Recent end-to-end speech emotion recognition systems are typically agnostic to the subjective nature of emotions, despite their state-of-the-art performance. In this work, we introduce an end-to-end Bayesian neural network architecture to capture the inherent subjectivity in the arousal dimension of emotional expressions. To the best of our knowledge, this work is the first to use Bayesian neural networks for speech emotion recognition. At training, the network learns a distribution of weights to capture the inherent uncertainty related to subjective arousal annotations. To this end, we introduce a loss term that enables the model to be explicitly trained on a distribution of annotations, rather than training them exclusively on mean or gold-standard labels. We evaluate the proposed approach on the AVEC'16 dataset. Qualitative and quantitative analysis of the results reveals that the proposed model can aptly capture the distribution of subjective arousal annotations, with state-of-the-art results in mean and standard deviation estimations for uncertainty modeling.

#11 Mind the gap: On the value of silence representations to lexical-based speech emotion recognition [PDF] [Copy] [Kimi1]

Authors: Matthew Perez ; Mimansa Jaiswal ; Minxue Niu ; Cristina Gorrostieta ; Matthew Roddy ; Kye Taylor ; Reza Lotfian ; John Kane ; Emily Mower Provost

Speech timing and non-speech regions (here referred to as ``silence"), often play a critical role in the perception of spoken language. Silence represents an important paralinguistic component in communication. For example, some of its functions include conveying emphasis, dramatization, or even sarcasm. In speech emotion recognition (SER), there has been relatively little work on investigating the utility of silence and no work regarding the effect of silence on linguistics. In this work, we present a novel framework which investigates fusing linguistic and silence representations for emotion recognition in naturalistic speech using the MSP-Podcast dataset. We investigate two methods to represent silence in SER models; the first approach uses utterance-level statistics, while the second learns a silence token embedding within a transformer language model. Our results show that modeling silence does improve SER performance and that modeling silence as a token in a transformer language model significantly improves performance on MSP-Podcast achieving a concordance correlation coefficient of .191 and .453 for activation and valence respectively. In addition, we perform analyses on the attention of silence and find that silence emphasizes the attention of its surrounding words.

#12 Exploiting Co-occurrence Frequency of Emotions in Perceptual Evaluations To Train A Speech Emotion Classifier [PDF] [Copy] [Kimi1]

Authors: Huang-Cheng Chou ; Chi-Chun Lee ; Carlos Busso

Previous studies on speech emotion recognition (SER) with categorical emotions have often formulated the task as a single-label classification problem, where the emotions are considered orthogonal to each other. However, previous studies have indicated that emotions can co-occur, especially for more ambiguous emotional sentences (e.g., a mixture of happiness and surprise). Some studies have regarded SER problems as a multi-label task, predicting multiple emotional classes. However, this formulation does not leverage the relation between emotions during training, since emotions are assumed to be independent. This study explores the idea that emotional classes are not necessarily independent and its implications on training SER models. In particular, we calculate the frequency of co-occurring emotions from perceptual evaluations in the train set to generate a matrix with class-dependent penalties, punishing more mistakes between distant emotional classes. We integrate the penalization matrix into three existing label-learning approaches (hard-label, multi-label, and distribution-label learning) using the proposed modified loss. We train SER models using the penalty loss and commonly used cost functions for SER tasks. The evaluation of our proposed penalization matrix on the MSP-Podcast corpus shows important relative improvements in macro F1-score for hard-label learning (17.12%), multi-label learning (12.79%), and distribution-label learning (25.8%).

#13 Positional Encoding for Capturing Modality Specific Cadence for Emotion Detection [PDF] [Copy] [Kimi1]

Authors: Hira Dhamyal ; Bhiksha Raj ; Rita Singh

Emotion detection from a single modality, such as an audio or text stream, has been known to be a challenging task. While encouraging results have been obtained by using joint evidence from multiple streams, combining such evidence in optimal ways is an open challenge. In this paper, we claim that although the multi-modalities like audio, phoneme sequence ids and word sequence ids are related to each other, they also have their individual local 'cadence', which is important to be modelled for the task of emotion recognition. We model the local cadence by using separate `positional encodings' for each modality in a transformer architecture. Our results show that emotion detection based on this strategy is better than when the modality specific cadence is ignored or normalized out by using a shared positional encoding. We also find that capturing the modality interdependence is not as important as is capturing of the local cadence of individual modalities. We conduct our experiments on the IEMOCAP and CMU-MOSI datasets to demonstrate the effectiveness of the proposed methodology for combining multi-modal evidence.

#14 Binary Early-Exit Network for Adaptive Inference on Low-Resource Devices [PDF] [Copy] [Kimi1]

Author: Aaqib Saeed

Deep neural networks have significantly improved performance on a range of tasks with the increasing demand for computational resources, leaving deployment on low-resource devices (with limited memory and battery power) infeasible. Binary neural networks (BNNs) tackle the issue to an extent with extreme compression and speed-up gains compared to real-valued models. We propose a simple but effective method to accelerate inference through unifying BNNs with an early-exiting strategy. Our approach allows simple instances to exit early based on a decision threshold and utilizes output layers added to different intermediate layers to avoid executing the entire binary model. We extensively evaluate our method on three audio classification tasks and across four BNNs architectures. Our method demonstrates favorable quality-efficiency trade-offs while being controllable with an entropy-based threshold specified by the system user. It also results in better speed-ups (latency less than 6ms) with a single model based on existing BNN architectures without retraining for different efficiency levels. It also provides a straightforward way to estimate sample difficulty and better understanding of uncertainty around certain classes within the dataset.

#15 Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [PDF] [Copy] [Kimi1]

Authors: Naoyuki Kanda ; Jian Wu ; Yu Wu ; Xiong Xiao ; Zhong Meng ; Xiaofei Wang ; Yashesh Gaur ; Zhuo Chen ; Jinyu Li ; Takuya Yoshioka

This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize ``who spoke what'' with low latency even when multiple people are speaking simultaneously. Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion. To further recognize speaker identities, we propose an encoder-decoder based speaker embedding extractor that can estimate a speaker representation for each recognized token not only from non-overlapping speech but also from overlapping speech. The proposed speaker embedding, named t-vector, is extracted synchronously with the t-SOT ASR model, enabling joint execution of speaker identification (SID) or speaker diarization (SD) with the multi-talker transcription with low latency. We evaluate the proposed model for a joint task of ASR and SID/SD by using LibriSpeechMix and LibriCSS corpora. The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.

#16 Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data [PDF] [Copy] [Kimi1]

Authors: Naoki Makishima ; Satoshi Suzuki ; Atsushi Ando ; Ryo Masumura

In this paper, we investigate the semi-supervised joint training of text to speech (TTS) and automatic speech recognition (ASR), where a small amount of paired data and a large amount of unpaired text data are available. Conventional studies form a cycle called the TTS-ASR pipeline, where the multi-speaker TTS model synthesizes speech from text with a reference speech and the ASR model reconstructs the text from the synthesized speech, after which both models are trained with a cycle-consistency loss. However, the synthesized speech does not reflect the speaker characteristics of the reference speech and the synthesized speech becomes overly easy for the ASR model to recognize after training. This not only decreases the TTS model quality but also limits the ASR model improvement. To solve this problem, we propose improving the cycle-consistency-based training with a speaker consistency loss and step-wise optimization. The speaker consistency loss brings the speaker characteristics of the synthesized speech closer to that of the reference speech. In the step-wise optimization, we first freeze the parameter of the TTS model before both models are trained to avoid over-adaptation of the TTS model to the ASR model. Experimental results demonstrate the efficacy of the proposed method.

#17 Audio-Visual Generalized Few-Shot Learning with Prototype-Based Co-Adaptation [PDF] [Copy] [Kimi1]

Authors: Yi-Kai Zhang ; Da-Wei Zhou ; Han-Jia Ye ; De-Chuan Zhan

Although deep learning-based audio-visual speech recognition (AVSR) systems recognize base closed-set categories well, extending their discerning ability to additional novel categories with limited labeled training data is challenging since the model easily over-fits. In this paper, we propose Prototype-based Co-Adaptation with Transformer (Proto-CAT), a multi-modal generalized few-shot learning (GFSL) method for AVSR systems. In other words, Proto-CAT learns to recognize a novel class multi-modal object with few-shot training data, while maintaining its ability on those base closed-set categories. The main idea is to transform the prototypes (i.e., class centers) by incorporating cross-modality complementary information and calibrating cross-category semantic differences. In particular, Proto-CAT co-adapts the embeddings from audio-visual and category levels, so that it generalizes its predictions on all categories dynamically. Proto-CAT achieves state-of-the-art performance on various AVSR-GFSL benchmarks. The code is available at https://github.com/ZhangYikaii/Proto-CAT.

#18 Federated Domain Adaptation for ASR with Full Self-Supervision [PDF] [Copy] [Kimi1]

Authors: Junteng Jia ; Jay Mahadeokar ; Weiyi Zheng ; Yuan Shangguan ; Ozlem Kalinli ; Frank Seide

Cross-device federated learning (FL) protects user privacy by collaboratively training a model on user devices, therefore eliminating the need for collecting, storing, and manually labeling user data. Previous works have considered cross-device FL for automatic speech recognition (ASR), however, there are a few important challenges that havenot been fully addressed. These include the lack of ground-truth ASR transcriptions, and the scarcity of compute resource and network bandwidth on edge devices. In this paper, we address these two challenges. First, we propose a federated learning system to support on-device ASR adaptation with full self-supervision, which uses self-labeling together with data augmentation and filtering techniques. The proposed system can improve a strong Emformer-Transducer based ASR model pretrained on out-of-domain data, using in-domain audios without any ground-truth transcriptions. Second, to reduce the training cost, we propose a self-restricted RNN Transducer (SR-RNN-T) loss, a new variant of alignment-restricted RNN-T that uses Viterbi forced-alignment from self-supervision. To further reduce the compute and network cost, we systematically explore adapting only a subset of weights in the Emformer-Transducer. Our best training recipe achieves a 12.9% relative WER reduction over the strong out-of-domain baseline, which equals 70% of the reduction achievable with full human supervision and centralized training.

#19 Augmented Adversarial Self-Supervised Learning for Early-Stage Alzheimer's Speech Detection [PDF] [Copy] [Kimi1]

Authors: Longfei Yang ; Wenqing Wei ; Sheng Li ; Jiyi Li ; Takahiro Shinozaki

The early-stage detection of Alzheimer's disease has been considered an important field of medical studies. While speech-based automatic detection methods have raised attention in the community, traditional machine learning methods suffer from data shortage because Alzheimer's record data is very difficult to get from medical institutions. To address this problem, this study proposes an augmented adversarial self-supervised learning method for Alzheimer's disease detection using limited speech data. In our approach, Alzheimer-like patterns are captured through an augmented adversarial self-supervised framework, which is trained in an adversarial manner using limited Alzheimer's data with a large scale of easily-collected normal speech data and an augmented set of Alzheimer's data. Experimental results show that our model can effectively handle the data sparsity problems and outperform the several baselines by a large margin. The performance for the ``AD" class has been improved significantly, which is very important to actual AD detection applications.

#20 Extending RNN-T-based speech recognition systems with emotion and language classification [PDF] [Copy] [Kimi1]

Authors: Zvi Kons ; Hagai Aronowitz ; Edmilson Morais ; Matheus Damasceno ; Hong-Kwang Kuo ; Samuel Thomas ; George Saon

Speech transcription, emotion recognition, and language identification are usually considered to be three different tasks. Each one requires a different model with a different architecture and training process. We propose using a recurrent neural network transducer (RNN-T)-based speech-to-text (STT) system as a common component that can be used for emotion recognition and language identification as well as for speech recognition. Our work extends the STT system for emotion classification through minimal changes, and shows successful results on the IEMOCAP and MELD datasets. In addition, we demonstrate that by adding a lightweight component to the RNN-T module, it can also be used for language identification. In our evaluations, this new classifier demonstrates state-of-the-art accuracy for the NIST-LRE-07 dataset.

#21 Thutmose Tagger: Single-pass neural model for Inverse Text Normalization [PDF] [Copy] [Kimi1]

Authors: Alexandra Antonova ; Evelina Bakhturina ; Boris Ginsburg

Inverse text normalization (ITN) is an essential post-processing step in automatic speech recognition (ASR). It converts numbers, dates, abbreviations, and other semiotic classes from the spoken form generated by ASR to their written forms. One can consider ITN as a Machine Translation task and use neural sequence-to-sequence models to solve it. Unfortunately, such neural models are prone to hallucinations that could lead to unacceptable errors. To mitigate this issue, we propose a single-pass token classifier model that regards ITN as a tagging task. The model assigns a replacement fragment to every input token or marks it for deletion or copying without changes. We present a method of dataset preparation, based on granular alignment of ITN examples. The proposed model is less prone to hallucination errors. The model is trained on the Google Text Normalization dataset and achieves state-of-the-art sentence accuracy on both English and Russian test sets. One-to-one correspondence between tags and input words improves the interpretability of the model's predictions, simplifies debugging, and allows for post-processing corrections. The model is simpler than sequence-to-sequence models and easier to optimize in production settings. The model and the code to prepare the dataset is published as part of NeMo project.

#22 Leveraging Prosody for Punctuation Prediction of Spontaneous Speech [PDF] [Copy] [Kimi1]

Authors: Yeonjin Cho ; Sara Ng ; Trang Tran ; Mari Ostendorf

This paper introduces a new neural model for punctuation prediction that incorporates prosodic features to improve automatic punctuation prediction in transcriptions of spontaneous speech. We explore the benefit of intonation and energy features over simply using pauses. In addition, the work poses the question of how to represent interruption points associated with disfluencies in spontaneous speech. In experiments on the Switchboard corpus, we find that prosodic information improved punctuation prediction fidelity for both hand transcripts and ASR output. Explicit modeling of interruption points can benefit prediction of standard punctuation, particularly if the convention associates interruptions with commas.

#23 A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings [PDF] [Copy] [Kimi1]

Authors: Fan Yu ; Zhihao Du ; ShiLiang Zhang ; Yuxiao Lin ; Lei Xie

In this paper, we conduct a comparative study on speaker-attributed automatic speech recognition (SA-ASR) in the multi-party meeting scenario, a topic with increasing attention in meeting rich transcription. Specifically, three approaches are evaluated in this study. The first approach, FD-SOT, consists of a frame-level diarization model to identify speakers and a multi-talker ASR to recognize utterances. The speaker-attributed transcriptions are obtained by aligning the diarization results and recognized hypotheses. However, such an alignment strategy may suffer from erroneous timestamps due to the modular independence, severely hindering the model performance. Therefore, we propose the second approach, WD-SOT, to address alignment errors by introducing a word-level diarization model, which can get rid of such timestamp alignment dependency. To further mitigate the alignment issues, we propose the third approach, TS-ASR, which trains a target-speaker separation module and an ASR module jointly. By comparing various strategies for each SA-ASR approach, experimental results on a real meeting scenario corpus, AliMeeting, reveal that the WD-SOT approach achieves 10.7% relative reduction on averaged speaker-dependent character error rate (SD-CER), compared with the FD-SOT approach. In addition, the TS-ASR approach also outperforms the FD-SOT approach and brings 16.5% relative average SD-CER reduction.

#24 Space-Efficient Representation of Entity-centric Query Language Models [PDF] [Copy] [Kimi1]

Authors: Christophe Van Gysel ; Mirko Hannemann ; Ernest Pusateri ; Youssef Oualil ; Ilya Oparin

Virtual assistants make use of automatic speech recognition (ASR) to help users answer entity-centric queries. However, spoken entity recognition is a difficult problem, due to the large number of frequently-changing named entities. In addition, resources available for recognition are constrained when ASR is performed on-device. In this work, we investigate the use of probabilistic grammars as language models within the finite-state transducer (FST) framework. We introduce a deterministic approximation to probabilistic grammars that avoids the explicit expansion of non-terminals at model creation time, integrates directly with the FST framework, and is complementary to n-gram models. We obtain a 10% relative word error rate improvement on long tail entity queries compared to when a similarly-sized n-gram model is used without our method.

#25 Domain Prompts: Towards memory and compute efficient domain adaptation of ASR systems [PDF] [Copy] [Kimi1]

Authors: Saket Dingliwal ; Ashish Shenoy ; Sravan Bodapati ; Ankur Gandhe ; Ravi Teja Gadde ; Katrin Kirchhoff

Automatic Speech Recognition (ASR) systems have found their use in numerous industrial applications in very diverse domains creating a need to adapt to new domains with small memory and deployment overhead. In this work, we introduce domain-prompts, a methodology that involves training a small number of domain embedding parameters to prime a Transformer-based Language Model (LM) to a particular domain. Using this domain-adapted LM for rescoring ASR hypotheses can achieve 7-13% WER reduction for a new domain with just 1000 unlabeled textual domain-specific sentences. This improvement is comparable or even better than fully fine-tuned models even though just 0.02% of the parameters of the base LM are updated. Additionally, our method is deployment-friendly as the learnt domain embeddings are prefixed to the input to the model rather than changing the base model architecture. Therefore, our method is an ideal choice for on-the-fly adaptation of LMs used in ASR systems to progressively scale it to new domains.